Personnel
Overall Objectives
Research Program
Application Domains
Highlights of the Year
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Matching and localization

Participants : Marie-Odile Berger, Vincent Gaudilliere, Antoine Fond, Pierre Rolin, Gilles Simon, Frédéric Sur.

Pose initialization

Estimating the pose of a camera from a model of the scene is a challenging problem when the camera is in a position not covered by the views used to build the model, because feature matching is difficult in such a situation. Several viewpoint simulation techniques have been recently proposed in this context. They generally come with a high computational cost, are limited to specific scenes such as urban environments or object- centered scenes, or need an initial guess for the pose. In his PhD thesis [12], P. Rolin has proposed a viewpoint simulation method well suited to most scenes and query views. Two major problems have been addressed: the positioning of the virtual viewpoints with respect to the scene, and the synthesis of geometrically consistent patches. Experimental results showed that patch synthesis dramatically improves the accuracy of the pose in case of difficult registration, with a limited additional computational cost.

Vanishing point detection

Accurate detection of vanishing points (VPs) is a prerequisite for many computer vision problems such as camera self-calibration, single-view structure recovery, video compass, robot navigation and augmented reality, among many others. we are interested in VP detection from uncalibrated monocular images. As any two parallel lines intersect in a VP, grouping line segments is a difficult problem that often yields a large number of spurious VPs. However, many tasks in computer vision, including the examples mentioned above, only require that the vertical (so-called zenith) VP and two or more horizontal VPs are detected. In that case, a lot of spurious VPs can be avoided by first detecting the zenith and the horizon line (HL), and then constraining the horizontal VPs on the HL. The zenith is generally easy to detect, as many lines converge towards that point in man-made environments. However, until recently, the HL was detected as an alignment of VPs, which led to a “chicken-and-egg” problem.

Last year, we showed that, assuming that the HL is inside the image boundaries, this line can usually be detected as an alignment of oriented line segments. This comes from the fact that any horizontal line segment at the height of the camera's optical center projects to the HL regardless of its 3-D direction. In practice, doors, windows, floor separation lines but also man-made objects such as cars, road signs, street furniture, and so on, are often placed at eye level, so that alignments of oriented line segments around the HLs are indeed observed in most images from urban or indoor scenes. This allowed us to propose a new method for VP detection, that was fast in execution and easy to implement. However, it was only middle rank in terms of accuracy. This year, we effectively put the HL detection into an a contrario framework. This transposal along with other improvements allows us to obtain top-ranked results in terms of both rapidity of computation and accuracy of the HL, along with more relevant VPs than with the previous top-ranked methods. This work has been submitted to CVPR 2018 (IEEE Conference on Computer Vision and Pattern Recognition).

Facade detection and localization

Planar building facades are semantically meaningful city-scale landmarks. Such landmarks are essential for localization and guidance tasks in GPS-denied areas which are common in urban environments. Detection of facades is also key in augmented reality systems that allow for the annotation of prominent features in the user’s view. We proposed in [19] a novel object proposals method specific to building facades. We define new image cues that measure typical facade characteristics such as semantics, symmetry and repetitions. They are combined to generate a few facade candidates in urban environments fast. We show that our method outperforms state-of-the-art object proposals techniques for this task on the 1000 images of the Zurich Building Database. We demonstrated the interest of this procedure for augmented reality through facade recognition and camera pose initialization. In a very time-efficient pipeline we classify the candidates and match them to a facade references database using CNN-based descriptors. We proved that this approach is more robust to severe changes of viewpoint and occlusions than standard object recognition methods.

We are currently investigating ways to perform registration from this set of facade proposals. As point-based approaches may be inefficient to perform image/model matching due to changes in the illumination conditions, we propose to rely on semantic segmentation to improve the accuracy of this initial registration. Registration is here improved through an Expectation-Maximization framework. We especially introduce a Bayesian model that uses prior semantic segmentation as well as geometric structure of the facade reference modeled by L p Gaussian mixtures. This work has been submitted to CVPR'2018.

AR in industrial environments

As industrial environments are normally inundated with textureless objects and specular surfaces, it is difficult to capture enough features and build accurate 3D models for camera pose estimation using traditional 2D/3D matching-based approaches. Moreover, as people usually check industrial objects with free motions, recent CNN-based approaches could easily fail if the training data is not properly collected (e.g. does not cover enough views around the objects) and augmented (e.g. over-zoomed and over-augmented). For these challenges, we presented a novel protocol for six degrees of freedom (6-DOF) camera pose learning and estimation without any 3D reconstruction and matching processes. In particular, we proposed a visually controllable method to collect sufficient training images and their 6-DOF camera poses from different views and camera-object distances. Building upon this, we proposed a transfer learning scheme to train convolutional neural networks to detect objects and estimate the coarse camera pose from a single RGB image in an end-to-end manner. Experiments show that the trained convolutional network estimates each camera pose in about 5 ms and obtains approximately 13.3mm and 4.8 deg accuracy, which is compatible for training or maintenance tasks in industrial environments.

This work has been submitted to WACV 2018 (IEEE Winter Conf. on Applications of Computer Vision), and an extended version to TVCG (IEEE Transactions on Visualization and Computer Graphics).